This report presents a Principal Component Analysis (PCA) of employment data from Eurostat. The goal is to:
Loading the necessary libraries for data manipulation, analysis, and visualization.
# Load necessary libraries
library(tidyverse) # For data manipulation and visualization
library(eurostat) # For downloading Eurostat data
library(FactoMineR) # For performing PCA
library(factoextra) # For visualizing PCA results
library(plotly) # For interactive plots
I Downloaded the lfsa_egan dataset and filtered it for:
- Years: 2021 to 2023
- Age
Group: 20-64 years (working-age population)
-
Sex: Both sexes (T)
-
Unit: Employment measured in thousands
(THS_PER)
# Download the 'lfsa_egan' dataset from Eurostat
data <- get_eurostat("lfsa_egan")
# Filter for the years 2021-2023, age 20-64, both sexes, and unit in thousands
data_filtered <- data %>%
filter(TIME_PERIOD >= as.Date("2021-01-01") & TIME_PERIOD <= as.Date("2023-12-31"),
age == "Y20-64",
sex == "T",
unit == "THS_PER") %>%
drop_na(values) # Remove rows with missing values
I Inspected the unique values in the citizen and geo columns to get a better understanding of the data.
unique(data$citizen)
## [1] "EU27_2020_FOR" "FOR" "NAT" "NEU27_2020_FOR"
## [5] "NRP" "STLS" "TOTAL"
unique(data$geo)
## [1] "AT" "BE" "CH" "CY" "CZ" "DE"
## [7] "DK" "EA20" "EE" "EL" "ES" "EU27_2020"
## [13] "FI" "FR" "HU" "IE" "IS" "IT"
## [19] "LU" "ME" "MT" "NL" "NO" "PT"
## [25] "RS" "SE" "SI" "SK" "UK" "BG"
## [31] "HR" "LT" "LV" "MK" "PL" "RO"
## [37] "BA" "TR"
I also created a data frame that maps each country code and categories to its corresponding full name.
# Create a named vector mapping country codes and categories to their full names
country_names <- c(
AT = "Austria", BE = "Belgium", CH = "Switzerland", CY = "Cyprus", CZ = "Czech Republic",
DE = "Germany", DK = "Denmark", EA20 = "Euro Area 20", EE = "Estonia", EL = "Greece",
ES = "Spain", EU27_2020 = "European Union 27", FI = "Finland", FR = "France", HU = "Hungary",
IE = "Ireland", IS = "Iceland", IT = "Italy", LU = "Luxembourg", ME = "Montenegro",
MT = "Malta", NL = "Netherlands", NO = "Norway", PT = "Portugal", RS = "Serbia",
SE = "Sweden", SI = "Slovenia", SK = "Slovakia", UK = "United Kingdom", BG = "Bulgaria",
HR = "Croatia", LT = "Lithuania", LV = "Latvia", MK = "North Macedonia", PL = "Poland",
RO = "Romania", BA = "Bosnia and Herzegovina", TR = "Turkey",
# Categories
EU27_2020_FOR = "EU27 Foreign Born", FOR = "Foreign Born", NAT = "National",
NEU27_2020_FOR = "Non-EU27 Foreign Born", NRP = "Non-Resident Population",
STLS = "Short-Term Labour Supply", TOTAL = "Total"
)
I decided to transform the dataset into a wide format with countries as rows and citizen categories as columns. Then I filled the missing values with 0.
pca_data <- data_filtered %>%
pivot_wider(names_from = citizen, values_from = values, values_fill = 0)
I prepared the data for PCA by removing non-numeric columns.
pca_input <- pca_data %>% select(-geo, -age, -sex, -unit, -freq, -TIME_PERIOD)
I standardized the data and perform PCA.
# Standardize the data
pca_data_scaled <- scale(pca_input)
# Perform PCA
pca_result <- PCA(pca_data_scaled, graph = FALSE)
The scree plot shows the percentage of variance explained by each principal component.
fviz_eig(pca_result)
pca_result$eig
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 5.843290e+00 8.347556e+01 83.47556
## comp 2 1.008833e+00 1.441191e+01 97.88747
## comp 3 1.097596e-01 1.567994e+00 99.45547
## comp 4 3.049431e-02 4.356330e-01 99.89110
## comp 5 7.622619e-03 1.088946e-01 99.99999
## comp 6 5.037309e-07 7.196156e-06 100.00000
## comp 7 6.907112e-11 9.867303e-10 100.00000
Interpretation:
- The first principal
component (Dim.1) explains 83.48% of the
variance.
- The second principal component (Dim.2)
explains 14.41%.
- Together,
Dim.1 and Dim.2 account for 97.89% of the total
variance.
fviz_contrib(pca_result, choice = "var", axes = 1)
Interpretation:
The variables
EU27_2020_FOR, FOR, NAT,
NEU27_2020_FOR, and TOTAL contribute almost
equally (~17%) to the first principal component. This
component captures general employment trends across these
categories.
fviz_contrib(pca_result, choice = "var", axes = 2)
Interpretation:
The NRP (Non-Resident
Population) category dominates the second component with
98.82% contribution, indicating that Dim.2 captures
variations specific to this category.
fviz_contrib(pca_result, choice = "var", axes = 1:2)
Visualize the countries in the PCA space, colored by their country codes.
# Replace country codes with full names in pca_data$geo
pca_data$geo <- country_names[pca_data$geo]
# Convert geo to a factor
pca_data$geo <- as.factor(pca_data$geo)
# Create the PCA plot with country names
pca_plot <- fviz_pca_ind(
pca_result,
label = "none",
habillage = pca_data$geo, # This now contains the full country names
addEllipses = TRUE
) +
labs(
title = "PCA of Employment Data by Country (2021-2023) - Interactive Plot",
x = "Principal Component 1 (83.5% Variance Explained)",
y = "Principal Component 2 (14.4% Variance Explained)",
)
# Convert the plot to an interactive plot
interactive_plot <- ggplotly(pca_plot)
# Manually update the legend labels to remove ",1"
for (i in seq_along(interactive_plot$x$data)) {
if (!is.null(interactive_plot$x$data[[i]]$name)) {
interactive_plot$x$data[[i]]$name <- gsub(",1", "", interactive_plot$x$data[[i]]$name)
}
}
# Display the interactive plot
interactive_plot
Countries are grouped based on their employment patterns. By looking
at the plot I identified 4 macro-clusters:
1.
Turkey on the top
2. Euro Area 20
and European Union 27 on the right
3.
Germany in the middle
4. The remaining groups on
the lower-left
By taking a closer look at the group on the
lower-left (draw a rectangle on the plot to zoom) I saw that
Italy, France and
Spain were plotted closer to eachother and further away
from the other EU-countries. This may show that these countries share a
similar employment distributions.
EU27_2020_FOR, FOR,
NAT, etc.).NRP) as a
distinct group.This PCA analysis helps to:
NRP).Eurostat Dataset: lfsa_egan